Towards real-time community detection in large networks 
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The recent boom of large-scale Online Social Networks (OSNs) both enables and necessitates the 
use of parallelisable and scalable computational techniques for their analysis. We examine the prob- 
lem of real-time community detection and a recently proposed linear time — 0(m) on a network with 
m edges — label propagation or "epidemic" community detection algorithm. We identify character- 
istics and drawbacks of the algorithm and extend it by incorporating different heuristics to facilitate 
reliable and multifunctional real-time community detection. With limited computational resources, 
we employ the algorithm on OSN data with 1 million nodes and about 58 million directed edges. 
Experiments and benchmarks reveal that the extended algorithm is not only faster but its commu- 
nity detection accuracy compares favourably over popular modularity-gain optimization algorithms 
known to suffer from their resolution limits. 
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I. INTRODUCTION 

Recent years have seen the flourishing of numerous On- 
line Social Networks (OSNs). Cyber communities such 
as Facebook, MySpace and Orkut, where users can keep 
in touch with friends on the Internet, have all emerged 
as top 10 sites globally in terms of traffic. Tools and 
algorithms to understand the network structures have 
consequently emerged as popular research topics. By 
their nature, OSNs contain an immense number of per- 
son nodes which are sparsely connected. Edges are often 
bidirectional since a mutual agreement is required be- 
fore such friendship links are established. One of the 
most notable phenomenon in such networks is the resem- 
blance of the so-called 6-degree of separation [l| where 
on average every person is related to another random 
person via 5 other people in the real world. This has 
indeed been shown in real life communities and, much 
more conveniently, on online communities [23| . Networks 
which exhibit such small degrees of separation while be- 
ing sparsely connected are famously known as Small- 
World Networks 0. 

Well established online communities often contain tens 
of millions of users connected by some billions of edges 
which enable — and necessitate — the use of parallelisable 
and scalable computational techniques for their analysis. 
In this literature, we examine the problem of network 
community detection. Graphically, such communities are 
characterized by a group of nodes which are densely con- 
nected by internal edges but less so towards the outside 
of the communities, as depicted by the densely connected 
subgraphs in Fig. [1] Understanding the community 
structure and dynamics of networks is vital for the design 
of related applications, devising business strategies and 
may even have direct implications on the design of the 
networks themselves 01 ■ 

We empirically analyse a recently proposed comniu- 
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FIG. 1: Snapshot of a subgraph of an OSN (500 nodes). 



nity detection technique by label propagation discussed 
in [4|, which is summarised as follows. Each node in a 
network is first given a unique label. Every iteration, 
each node is updated by choosing the label which most 
of its neighbours have (the maximal label) . If there hap- 
pens to be multiple maximal labels (which is typical in 
the beginning), one label is picked randomly. Previous 
results have shown that this algorithm is extremely effi- 
cient in uncovering accurate community structure. As an 
example, we apply the algorithm on a set of OSN con- 
nection data crawled by Mislove ct al. 3 of 3 million 
nodes connected by roughly 0.2 billion directed links. 

We give a survey of related work in the next Section 
and look further into the characteristics of the algorithm 
in Section HITl We discuss the potential implementations, 
improvements and applications of the algorithm on dif- 
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ferent types of networks fSection lIVp . Section fVl gives 
detailed comparisons between the label propagation al- 
gorithm (LPA) and fast modularity-optimization algo- 
rithms. We conclude the paper with future directions of 
research in Section IVIl 



II. RELATED WORK 

Community detection in complex networks has at- 
tracted ample attention in recent years. Apart from 
OSNs, researchers have engaged in community analysis in 
various types of networks. In the case of the Internet, ex- 
amples of communities are found in autonomous systems 
and indeed web pages of similar topic @ . In biologi- 
cal networks, it is widely believed that modular structure 
plays a crucial role in biological functions [3]. Related 
literatures such as [1, [l^l may serve as introductory 
reading, which also include methodological overviews and 
comparative studies of different algorithms. 

The detection of community structure in a network is 
generally intended as a procedure for mapping the net- 
work into a tree known as dendrogram. In this tree, 
the leaves are the nodes and the branches join them or 
(at a higher level) groups of them, thus identifying a hier- 
archy of communities. Nodes can either be agglomerated 
successively starting from single nodes (agglomerativc), 
or the whole network can be recursively partitioned (divi- 
sive) . Newman and Girvan introduced a seminal divisive 
algorithm in which the selection of the edge to be cut 
is based on the value of its edge betweenness [l^, the 
number of shortest paths between all node pairs running 
through it. It is clear that when a graph is made of 
tightly bound clusters, each loosely interconnected, all 
shortest paths between nodes in different clusters have 
to go through the few inter-cluster connections, which 
therefore have a large betweenness value. Recursively 
removing these large betweenness edges would partition 
the network into communities of different sizes. 

Quantitatively, however, we need a metric to measure 
how well the community detection is progressing, other- 
wise most algorithms would either continue until every 
node is split into a single community or all join together 
into one. Newman and Girvan proposed in [l^ a mea- 
sure of the goodness of communities called modularity, 
for the set of uncovered communities C, the modularity 
is defined to be : 

where Ic indicates the total number of internal edges that 
have both ends in c, Oc is the number of outgoing edges 
that have only one end in c and E is the total number 
of edges. This measure essentially compares the number 
of links inside a given module with the expected value 
for a randomized graph of the same size and same degree 
sequence. 



The concept of modularity has gained such popular- 
ity that it has not only been used as a measure of 
the community partitioning of a network but also as 
a key fitness indicator in various community detection 
algorithms. The algorithm proposed by Clauset, New- 
man and Moore (CNM) [i3|, which greedily combines 
nodes/communities to optimize modularity gain, is per- 
haps to date one the most popular algorithms in detect- 
ing communities in relatively large scale networks. In 
the time when CNM was proposed, it was then the only 
algorithm capable of community detection on networks 
of size 500,000 in a matter of hours. Throughout the 
yea rs, several variations of the CNM have been proposed 
[1^ . [lil . [iBl . Most of them concentrate on more efficient 
data structures as well as modularity gain heuristics to 
improve the overall performance. A latest adaptation 
[iq that treats newly combined communities as a single 
node after each iteration is able to identify community 
structure on a network containing 1 billion edges in a 
matter of hours. 

It is vital, however, to understand that modularity 
is not a scale-invariant measure and hence, by blindly 
relying on its maximization, detection of communities 
smaller than a certain size is impossible. This is famously 
known as the resolution limit [T^ of modularity based 
algorithms. Since LPA does not involve modularity op- 
timization, its community detection capability is scale- 
independent and therefore not affected by the resolution 
limit as will be shown in Section [V] 



III. DISCUSSION 

Here, we give a brief discussion on the characteristics 
of the algorithm as well as some preliminary results ap- 
plying the algorithm on the OSN described above. 

A. A "near linear time" algorithm 

One can consider the label spreading as a simplified 
but specific case of epidemic spreading where all indi- 
viduals are considered infectious with their own unique 
disease. Each person is infected by a disease that is 
prevalent in his or her neighbourhood. Fig. [2] depicts 
the labelling convergence seen in a 4-clique. The number 
of clusters monotonically decreases each iteration as cer- 
tain labels become extinct due to domination by other 
labels. With certain rare and exceptional cases, the la- 
belling self-organises to an unsupervised equilibrium ef- 
ficiently. 

As suggested in Q , certain properties may prevent the 
equilibrium from occurring. For instance, a network with 
a bipartite structure might render the system to oscil- 
late if the algorithm is run synchronously, i.e., all nodes 
are updated together only after they have selected their 
maximal labels. Running the algorithm asynchronously 
in a randomized order every iteration, as suggested in 
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FIG. 2: Each node is looked at in a certain order and a new 
label is selected. The above shows how nodes in a 4-clique 
self-organise into one single community in one iteration. 



the paper, may result in less definitive results but solves 
the problem. It was also suggested that a node that has 
two equally maximal labels to choose from may fail to 
converge and an extra stopping criterion to prevent the 
switching of label would have to be in place. It is, how- 
ever, noted in our implementation that including the con- 
cerned label itself into the maximal label consideration ef- 
fectively avoids all the above non-convergent behaviours 
and the requirement for an extra stopping criterion. 

In one iteration, each node's neighbours are examined 
and the maximal label is chosen. The running time of 
this algorithm is therefore 0{knd), where k is the number 
of iterations, n the number of nodes and d the average 
degree of nodes. Note that nd can also be described 
by TO, the number of edges. The number of iterations 
required, k, is dependent on the stopping criterion but is 
not very well understood. [1] suggested that the number 
of iterations required is independent to the number of 
nodes and that after 5 iterations, 95% of their nodes are 
already accurately clustered. 

Since labels can hardly affect nodes outside their lo- 
cal densely connected substructures, the convergent be- 
haviour should be dependent on these substructures 
rather than the whole network. This is confirmed by 
preliminary testing and directs us to look at substruc- 
tures which can ultimately become the community. Ex- 
periments show that the average number of iterations re- 
quired for the labelling to converge (no change in labels) 
in an A^-clique for the asynchronous and synchronous im- 
plementations are 2.1 and 3.6 respectively, highly inde- 
pendent of A^. To further investigate the average con- 
vergent behaviour on a substructure, we look at Fig. [3] 
which summarises the relationship between number of it- 
erations required before convergence, fc, to the pairwise 
connectivity, p, that controls the edge density in a ran- 
dom graph of size N (where p = I corresponds to the 
A^-clique) . 

In both implementations, we see that k remains fairly 
constant over both N and p until p reaches a certain 
threshold, which when reached we begin to see an inverse 
dependence between N and k. The overall averages of 
asynchronous and synchronous implementations in this 
case are 2.8 and 5.2. 

Let us, however, consider another simple but non- 
random topology. Suppose we start off with an A'^-Clique, 
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FIG. 3: The above plots show the number of iterations re- 
quired before convergence for both the synchronous and asyn- 
chronous implementations on a random graph of size A'' with 
probability of pairwise connection p. All values here are av- 
eraged over 100 realisations. 



at each j construction, the graph is grown by connect- 
ing the A^ — j most recently joined nodes to the new node 
(c.f. Fig. SD. 




FIG. 4; This substructure is constructed on an A'-clique, A' = 
25, by attaching each new node, labelled I ,N < I < 2N, to 
existing nodes I — 1 . . . 2{l — N), thus contains 49 (2A^ — 1) 
nodes and 600 {N{N - 1)) edges. 

These structures by construction will converge into a 
single community by LPA. Without worrying about how 
abundant such patterns are in real world communities, 
we look at the convergent behaviour shown in Fig. [5l 
The trend clearly reveals that k grows logarithmically 
with respect to A^. We therefore suggest the possible 
worst case of k of the order of O(logA^), where A^ is the 
size of the largest substructure with a topology similar 
to the above. Indeed, we anticipate real world social 
networks to contain highly heterogeneous substructures 
which may be intricately connected to affect each other's 
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FIG. 5: The relationships between the number of iterations 
required before convergence, k, of both implementations to 
the size, A^, of the aforementioned structure. All values here 
are averaged over 100 realisations. 



convergence. We thus consider the understanding of the 
convergent behaviour in large complex networks such as 
OSNs as a direction for further investigation. 

B. Community Detection in OSN 

We carry out community detection on the aforemen- 
tioned OSN using a desktop PC with 4GB ram and a 2.4 
GHz quad-core processor running 32-bit Java VM 1.6. 
Due to limited memory, we restrict the number of nodes 
to the first million. Since the order of nodes in the origi- 
nal data corresponds to that of a breath-first web crawl- 
ing, this way of "cutting off' the data is equivalent to ex- 
tracting a snowball sample. As discussed in Q , snowball 
methods are known to over-sample high-degree nodes, 
under-sample low-degree ones and overestimate the av- 
erage node degree. This is seen by the higher average de- 
gree of the subgraph, 250, compared to 106 of the original 
graph. Nonetheless, since the purpose of this literature 
is to evaluate the algorithm on large-scale networks, the 
sampled network satisfies our requirements. The sam- 
pled subgraph contains 1,000,000 nodes and 58,793,458 
directed links. Convergent behaviours of the two differ- 
ent implementations are shown in Fig. [S) 

A crucial point is that in a complex network as large 
as this, the so called "convergence" docs not necessarily 
yield an optimal result in terms of modularity. For exam- 
ple, we see the asynchronous implementation merely took 
on average 5 iterations to achieve a maximum modularity 
but has highly volatile results in different runs as depicted 
by the shaded area in the figure. On the other hand, the 
synchronous implementation achieved maximum modu- 
larity much slower than the asynchronous version but 
its performance on average is much more stable (its per- 
formance range is thus omitted). The performances of 
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FIG. 6: Average performances of asynchronous and syn- 
chronous LPA. Values are averaged over 5 Runs. Shaded area 
denotes the range of the performances of asynchronous imple- 
mentation. 



these two different implementations are equally impor- 
tant to be understood and utilised. Further discussions 
on the implications of these implementations and their 
utilizations arc given in Section HVl 

Each single-threaded iteration finishes in a matter of 
tens of second and thus, depending on the stopping cri- 
terion, it can take as little as 8 to 10 minutes up to peak 
performance. Extrapolating the time required with re- 
spect to the number of edges, the algorithm without any 
optimization should be able to detect communities on a 
graph with 1 billion edges in less than 180 minutes, in a 
magnitude similar to that in [l^ . 

Fig. [7] shows the distribution of community/cluster 
size collected by a specific run of the asynchronous ver- 
sion of the algorithm when the modularity peaked at 
0.638. The size distribution of communities within the 
OSN follows a 2-part power law distribution in the com- 
plementary CDF with an estimated coefficient of 1.1. 
The interested reader is referred to [l^, [11] for discus- 
sions on the characteristics of different networks. 



IV. A MORE RELIABLE AND EFFICIENT 
ALGORITHM 

In this section, we discuss potential modifications to 
the algorithm to increase its reliability, functionality and 
computational efficiency. 

A. Hop Attenuation &i Node Preference 

Due to the "epidemic" nature of the algorithm, a major 
limitation of the algorithm is noted where certain "label 
epidemic" manages to "plague" a large amount of nodes. 
To be exact, in some runs a certain community of size 
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FIG. 7: The community- size distribution of communities un- 
covered by the algorithm, which follows a 2-part power law. 



over 500,000 (50% of the number of nodes) is formed — as 
opposed to the thousand other counterparts averagely 
sized in a magnitude of 100s — greatly contributing to 
modularity drop after the peak. We conjecture that this 
is partially due to the asynchronous nature of the algo- 
rithm and the initial formation of communities, where 
certain communities do not form strong enough links to 
prevent a foreign "epidemic" to sweep through. Further 
experiments confirm that the synchronous version of the 
algorithm slows down the formation of such "monster" 
communities but do not prevent them. 

We propose an extension to this algorithm by adding 
a score associated with the label which decreases as it 
traverses from its origin. A node is initially given a score 
of 1.0 for its label. After a node i has collected from 
its neighbourhood, TVi, all the respective labels and the 
scores, the calculation of the new maximal label, can 
be generalised by: 



argmax 
c 



E 



(2) 



where Ci is the label of node i, Si{C) is the hop score 
of label C in z, wi' a is the weight of the edge between 
i' and i (we sum the weights in both directions if the 
graph is directed) and /(z) is any arbitrary comparable 
characteristic for any node i. For instance, if we define 
f{i) = Deg(i), when to > 0, more preference is given to 
node with more neighbours; m < 0, less. The final step 
is to assign a new attenuated score s' to the new label C 
of i by subtracting hop attenuation 5, < 5 < 1: 



max s. 



(AO 



(3) 



can spread as a function of the geodesic distance from its 
origin. This additional parameter adds in extra uncer- 
tainties to the algorithm but may encourage a stronger 
local community to form before a large cluster start to 
dominate. Ideally, the selection of 5 can even be adaptive 
to current number of iteration, the neighbourhood of the 
node concerned and perhaps some a priori network pa- 
rameters. We investigate the use of varying 5 in the next 
section and assume here a constant value for 5. Note 
that this setting may induce a negative feedback loop, 
we therefore let (5 = if the selected label is equal to the 
current label. 

As discussed, modularity has been widely used in the 
literature as a metric to contrast the community detec- 
tion capabilities on real world networks between differ- 
ent algorithms. Whilst high modularity indicates a sig- 
nificant modularised structure over a randomised graph 
of the network concerned, the correspondence between 
high modularity and accurately partitioned communities 
is not well understood due to the resolution limit of mod- 
ularity. Here we attempt to contrast the behaviours of 
the algorithms on the OSN based on modularity but shall 
not draw strong conclusions on the accuracies of the com- 
munity detection due to the above reasons. In Sectio n IVl 
a novel benchmark proposed by Lancichinetti et. al. [l9[ 
capable of revealing resolution limit of modularity-based 
algorithms is used for further comparisons. 

Fig. [8] depicts the average performance curves over 5 
runs for both versions of the algorithm applying hop at- 
tenuation and preferential linkage. The results suggest 
that, on both implementations, a slight but not too high 
a preference on high-degree nodes (to > 0) can speed up 
the process for achieving peak modularity on the OSN 
network but also gives rise to a steeper drop as shown in 
Fig. 8(a) We believe, however, different magnitudes of 



where Ni{C) is the set of neighbours of i that has la- 
bel C. The value 5 governs how far a particular label 



TO simply restrict the choice of nodes to different subsets, 
some of which may contribute to a "global pandemic" 
and some may not. By simply using the degree of a node 
may not be a heuristic generic enough for different net- 
works. Further study is required to understand, if at all 
possible, how to deduce a generic preference on neigh- 
bourhood labels every iteration without resorting to a 
global metric, which is costly. Nonetheless, we show that 
giving preference to certain nodes over others when de- 
ciding between labels to accept can be beneficial in terms 
of number of iterations to achieve maximum modularity. 

Looking at hop attenuation, we find that the applica- 
tion of 5 indeed deters the occurrence of the "monster 
clusters" as expected and thereby preventing the modu- 
larity drop after certain iterations. But it was also ob- 
vious that high hop attenuation prevented the healthy 
growing of the communities and restricted the increase 
in modularity (c.f. Fig. 8(b)|8(e)" ). Moreover, we con- 
jecture that hop attenuation restrains the spread of the 
label from an arbitrary center and thereby the formation 
of circular clusters. This suppression in forming non- 
circular clusters may lead to the suboptimal performance 
in terms of modularity, as shown in the asynchronous 
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FIG. 8: Average performance comparisons of the synchronous and asynchronous implementations with varying 5 and m over 
5 Runs. 



case (Fig. 8(e)). Finally, from Fig. 8(c) and 8(f) we 



sec that combining both parameters, on average, bene- 
fits both versions of the algorithm in achieving a commu- 
nity partitioning of high modularity more efficiently and 
consistently. 



B. Hierarchical &: Overlapping Communities 

Communities in certain networks are known to be hier- 
archical. For instance, students in the same classes often 
form some strong local communities while these commu- 
nities, say of the same school, in turn form a larger but 
relatively weaker community. As discussed in Section [III 
most CNM-based algorithms are inherently hierarchical 
since communities are agglomerated by greedy local op- 
timization of modularity gain. 

We present two simple modifications to the original 
method to enable the detection of hierarchical commu- 
nities. Firstly, let us consider the application of hop at- 
tenuation on label propagation. Suppose we impose a 
very high hop attenuation at the beginning, we expect 
communities of small diameter to form. If we then grad- 
ually relax the attenuation value, we should expect these 
small communities to merge into larger ones. In order to 
achieve this, we modify eq. ^ as follows: 



(4) 



where 



dG(0(A),*) = 1 + 



min dG(O(A),0- (5) 



Essentially, instead of receiving the current hop scores 
from the neighbourhood and carry out a subtraction, the 
score is now determined by the actual geodesic distance 
{do) from the label £'s origin, denoted by 0{£) and the 
function 6. This gives greater flexibility of S in terms of 
geodesic distances and can facilitate iteration-dependent 
hop attenuation as required here with slight extra com- 
putation cost. 

Our second proposal is inspired from [l6| , where we 
can similarly treat newly combined communities as a sin- 
gle node, and use the number of inter-community edges 
as the weight of edges between these "fresh condensed" 
nodes. Instead of doing this every iteration, we can apply 
certain amount of hop attenuation or hard limit in terms 
of the diameter of the community and do this after an 
equilibrium is reached. 

Fig. M gives an illustration of the first modification 
applied on a subgraph on the OSN. Note that this mod- 
ification depends very much on the initial labelling of 
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nodes because it determines the initial centers of these 
smah communities. 




FIG. 9: (Color online) Community detection in the OSN 
(n=3000) by gradually decreasing hop attenuation {5 = 0.5 
at the top with Q = 0.64, 5 = at the bottom with Q = 0.78). 
Nodes with 3 or less neighbours are filtered to ease the visu- 
alisation. 



Another important question which was also briefly ad- 
dressed in [3l is the problem of overlapping communi- 
ties (20j . i.e., nodes can often be considered a member 
of different communities. From previous sections, we un- 
derstood that different asynchronous version of the al- 
gorithm is capable of generating very different results in 
different runs. This is exactly how [J| suggested as a po- 
tential solution - to re-run the algorithm several times. 
In a parallel environment, however, the results tend to be 
much less fluctuating. An initial attempt was to increase 
the number of labels passed each time between nodes to 
achieve a similar effect. Preliminary experiments indi- 
cate limited success since this setting hampers the con- 
vergence process, possibly due to the potential of latent 
labels switching back and fro in the system. Another pos- 
sibility is the exploit the fact that nodes on the border 
of its community have different proportions (purity) of 
neighbours from other communities. We can potentially 
use that as a measure of membership but this indeed may 
only be applicable to such boundary nodes. 



C. Optimization 

The individual inspection of every node, particularly 
those with many neighbours, is a crucial factor in deter- 
mining the speed of the algorithm. Putting aside efficient 
data structures and prudent programming, an obvious 
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FIG. 10: The difference in % modularity and speed of the 
optimized modifications with the original. 



optimization we can do without much compromise on the 
performance is to selectively update high degree nodes. 
The reader may have realised that, after certain itera- 
tions, it would be pointless to update certain nodes that 
arc well inside a cluster. These nodes arc surrounded by 
nodes with the same label, which are unlikely to change 
for the same reason. We employ a simple purity mea- 
sure of neighbours to selectively update nodes that are 
on the borders of their communities. In other words, we 
only update nodes whose number of neighbours sharing 
the maximal label is less than a certain percentage. In- 
deed, small degree nodes are likely to be avoided in early 
iterations in this setting but their contributions to the 
overall community structure and performance are almost 
insignificant. We carry out the modified algorithm with 
thresholds set at 100% (equivalent to the unmodified al- 
gorithm), 80%, 60% and 40% to examine the trade off 
between accuracy and speed. 

Figure [TO] reveals that after the 1"* iteration, the ex- 
tra constraint will increasingly avoid updating nodes. As 
more nodes settle in a more stable cluster, increasingly 
less amount of time will be required in an iteration. In- 
terestingly, even with a threshold as low as 40%, the ab- 
solute difference in modularity compared to the original 
setting is reasonably small; and we can see the overall 
running time can be significantly reduced. 



D. Parallel & Online Analysis 

Clear advantages of label propagation include its ease 
to be parallelized and its potential online implementation 
in real time networks. Since each node is required only 
to know information about its neighbours and updates 
itself according to the common rules, parallelism can be 
easily achieved. This brings us to another technical point 
that when the algorithm is completely parallelized, even 
without explicit synchronization, it would tend to behave 
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like the synchronous version of the algorithm. And this is 
the key reason why we have stressed in the literature that 
improving both synchronous and asynchronous versions 
of the algorithm are equally important. 

The running time in a parallel environment effectively 
reduces to k if there are Q{n) machines. This can be 
achieved in real world ubiquitous system such as a mo- 
bile ad-hoc network (MANET) or potentially on OSNs 
themselves (if members are willing to contribute their 
computational power) in real time. For instance, social 
information such as the community structure is known to 
benefit routing in MANET [2l|. Moreover, in such sce- 
narios the space requirement for storing link information 
would become decentralised and thus insignificant. 

On the same note, we see great potential in adapting 
the algorithm for online community detection in real-time 
dynamic networks where the presence of nodes and edges 
are constantly evolving. The microscopic movements and 
intermittent presence of nodes contribute to changes in 
terms of weights of the edges. These in turn result in five 
distinct macroscopic behaviours of communities, namely: 
growth, shrinkage, union, division and death of commu- 
nities. The challenge indeed is to detect local changes 
without the need for a global update given limited com- 
putational resource or time constraint. We believe label 
propagation is particularly suited in this paradigm and 
thus propose this as future work. 



(c.f. Fig. 11(b)). We believe this corresponds to the 



formation of monster communities discussed in Section 
IIV Al The number of nodes and the average degree of 
the benchmark graphs in effect dictate the number and 
sizes of the original communities generated. The results 
hence point out that denser and less modularized graphs 
arc relatively prone to the formation of monster commu- 
nities. However, the application of hop attenuation as 



exemplified in Fig. 11(b) greatly improves the overall 



performance of LPA in such scenarios. 

Importantly, as opposed to label propagation, we can 
see that CNM algorithm's performance docs not merely 
depend on the mixing parameter but also the average 
degree of the network. Resolution limit of modularity 
maximization is reflected by CNM's worse performance 
in graphs having a smaller average degree. Although in 
most configurations all algorithms expectedly manage to 
uncover a modularity value of a similar magnitude, the 
real accuracy in terms of NMI does not follow. This 
finding corresponds to the notion in (iTj that modularity 
maximisation does not simply translate to actual com- 
munities. 



VI. CONCLUSIONS 



V. COMPARISONS 

We first look at two relatively large and previously 
studied networks for comparisons. These networks are 
respectively the Amazon Purchasing Network analysed 
in 13 1 and the actor collaboration network [2^ . As done 
in [13| , we assume all edges to be undirected to ease the 
analysis. With the added heuristics, the algorithm is able 
to perform within 5% of CNM and 10% of the adapta- 
tion by Danon, Dfaz-Guilera and Arenas (CNM-DDA) 
in terms of modularity (c.f. Table IJ). LPA. how- 
ever, achieves the result in a matter of minutes which is 
unparalleled by the above. 

For a more standardized comparison, we turn to the re- 
cently proposed benchmark graphs by Lancichinetti et. 
al. [l3|, an extension to the well known GN benchmark 
which incorporates more realistic scale-free degree 
and cluster-size distributions. We follow closely the im- 
plementation of the benchmark graphs as described in 
[l9| and compare the original LPA with the improved 
version on the graphs of size 1000 and 5000. To contrast 
label propagation with general fast modularity maximi- 
sation algorithms, we also run the benchmarks on the 
CNM algorithm. 

As shown in Fig. [TI] both implementations achieve su- 
perior accuracy over CNM in terms of normalised mutual 
information (NMI) even up to a mixing parameter of 0.6. 
Interestingly, the original method shows signs of failure 
at /X = 0.5 in the N = 1000, d = 50 benchmark graphs 



In this literature, we have empirically analysed a scal- 
able, efficient and accurate community detection algo- 
rithm. We discussed the behaviours and emphasized the 
importance of both the synchronous and asynchronous 
implementations of the algorithm. We suggested poten- 
tial heuristics that can be applied to improve its aver- 
age detection performance and adaptability. Most im- 
portantly, we contrasted the algorithm with modularity- 
gain based methods in terms of community detection ac- 
curacy and observed how it can be potentially applied 
online and concurrently in large-scale and real-time dy- 
namic networks. 

Understanding the dynamics of this algorithm would 
be the major future work of this discipline before one 
devises further heuristics to improve the algorithm. We 
believe that each notion discussed in Section HVl is wor- 
thy of further inspection. An equally important point is 
to analyse mathematically or empirically on how to best 
adapt the algorithm to different types of networks by the 
added heuristics. How do different network topologies 
and models affect the algorithm's convergent behaviour? 
These are all valuable questions to be investigated in fu- 
ture work. 

In summary, we show that label propagation with the 
appropriate modifications is a more reliable and efficient 
method in detecting communities in large-scale networks 
than popular existing methods. We trust that with fur- 
ther understanding and analysis epidemic-based commu- 
nity detection would be of substantial value to the field. 
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Network 


Size 


Directed Links 


Q(Claiined) 


Peak Q(Sync.) 


Peak (5(Async.) 


Amazon Purchase(Mar'03) 
Actor Collaboration 


409,687 
374,511 


4,929,260 
30,052,912 


0.745 [13] 
0.528 [4], 0.719 [14] 


0.724 
0.642 


0.727 
0.660 



TABLE L The results correspond to the peak modularity achieved in 10 iterations or less, with / = Deg and m — 0.1 and a 
gradually decreasing S as discussed in Section TlVBI 

rCNM — I — LPA — LPA-8 - -« -- 




0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 0.4 0.5 0.6 



Mixing parameter, \i Mixing parameter, n 

(c)Ar = 5000,d = 15 (d)Af = 5000, d = 50 

FIG. 11: Average performance comparisons between the three algorithms in the benchmark graphs with size N and average 
degree d. Both versions of LPA here are asynchronous; LPA-5 implements a gradually decreasing 6 as discussed in Section 
IIVBI All benchmark graphs have power-law degree and cluster-size distributions with exponent 3 and 2. For A*' — 1000, the 
results are the average over 100 realisations; for — 5000, over 10 realisations. 
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